Today, data and its application are growing at an exponential rate. Merging with business objectives, it can unravel new opportunities and possibilities in various realms. One such sector, customer acquisition and retention, benefits greatly from diverse and voluminous data. Proposing to analyze and interpret the existing customer base from a chain of convenience stores for this project, we intend to infer the store profitability relative to demographic elements, customer preferences, and effective marketing strategy thereby, contributing significantly to the commercial success of each store. We aim to help the management of Convenient Food Mart by doing various analysis mentioned below which would help them to drive their business with more profits and sustainability. We have considered the factors which would not just focus on the store attributes, but also focus on the customer behaviour while they spend time in our stores. As we all know sustainability can help businesses to reduce their cost and improve the efficiency of operations, we have suggested few ways in which the management can improve the sustainability of the stores.
We would be analyzing and deriving following insights from the dataset -
recyclable_package is a categorical column which tells us whether a store uses recyclable package or not for a specific good and
it's corresponding brand. Using the data present in this column, we have derived the relevant insights. This insights would be important
for management to decide the level of sustainability they already have, and how they can improve.
Is inconsistency always bad? We doubt that! The beauty of our dataset is that all stores from different state and cities have different
featuers. These features are related to the store properties, like store size, store type, etc. Based on the relevant features, we'll observe
and infer the profitable stores across various regions in the US. We have also extened this analysis based on the location. CFM operates
on the franchise system, hence, it has stores all over United States. But each store has different properties which determines
the profits. Using the data present, we'll research about the factors which are driving stores to perform the way they are based on their
demographic locations.
promotion_name is a categorical feature in our dataset which tells us, how various promotion campaigns are driven on certain goods of the
corresponding brand. These campaigns are carried out with a media (like TV, radio, etc). We'll research the popular promotion and the effective
media-type based on the number of purchases and the profits earned across various regions in the US.
This dataset contains sufficient features to analyze the customers and their shopping preferences. For example, which customers (with or
without kids) are prone to buy certain membership of the store. By looking at the relevant featuers, we'll derive insights and
inferences on customer behavior. We'll also extend this analysis by probing into two pairs of correlation, one between food category
and
store location, the other between food category and store location. Here, we'll analyze how consumer preference for food is affected
based on the location, and which food type is popular and based on the customer martial status, what food categories can be pushed ahead
to increase the sales.
CFM operates on the franchise system, hence, it has stores all over United States. But each store has different properties which determines
the profits. Using the data present, we'll research about the factors which are driving stores to perform the way they are based on their
demographic locations.
Using the categorical and numerical variables we'll try to predict the cost for
acquiring customers.
Transforming this enormous information into quality data, we thereby derive actionable intelligence to enable informed strategies and optimize commercial engagements.
The libraries used for python implementation are -
pandas (documentation)seaborn (documentation)matplotlib(documentation)numpy (documentation)plotly (documentation)
Note: We will keep adding libraries to the above list as we keep implementing furtherOur dataset consists of over 60,000 records and 40 features of customers' details, their shopping preferences and store details of a Convenient Food Mart (CFM). CFM is a chain of convenience stores in the United States (also located in some states of Mexico and Canada). The private company's headquarters are located in Mentor, Ohio, and there are currently approximately 325 stores located in the US. Convenient Food Mart operates on the franchise system. Our scope of the project includes analysis only of the stores based in US.
Data source: Kaggle
# importing required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import warnings
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.preprocessing import OrdinalEncoder
from sklearn import cluster
from sklearn import decomposition
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import scipy.stats as stats
import re
import random
import statsmodels.api as sm
import scipy.stats as ss
warnings.filterwarnings('ignore')
data = pd.read_csv(r'C:\Users\Chaconne\Documents\学业\UMD\Assignments\704_Python\Group Project\FinalNB\media prediction and its cost.csv')
print(f'Inital shape {data.shape}')
data.head()
Inital shape (60428, 40)
| food_category | food_department | food_family | store_sales(in millions) | store_cost(in millions) | unit_sales(in millions) | promotion_name | sales_country | marital_status | gender | ... | grocery_sqft | frozen_sqft | meat_sqft | coffee_bar | video_store | salad_bar | prepared_food | florist | media_type | cost | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breakfast Foods | Frozen Foods | Food | 7.36 | 2.7232 | 4 | Bag Stuffers | USA | M | F | ... | 18670 | 5415 | 3610 | 1 | 1 | 1 | 1 | 1 | Daily Paper, Radio | 126.62 |
| 1 | Breakfast Foods | Frozen Foods | Food | 5.52 | 2.5944 | 3 | Cash Register Lottery | USA | M | M | ... | 18670 | 5415 | 3610 | 1 | 1 | 1 | 1 | 1 | Daily Paper, Radio | 59.86 |
| 2 | Breakfast Foods | Frozen Foods | Food | 3.68 | 1.3616 | 2 | High Roller Savings | USA | S | F | ... | 18670 | 5415 | 3610 | 1 | 1 | 1 | 1 | 1 | Daily Paper, Radio | 84.16 |
| 3 | Breakfast Foods | Frozen Foods | Food | 3.68 | 1.1776 | 2 | Cash Register Lottery | USA | M | F | ... | 18670 | 5415 | 3610 | 1 | 1 | 1 | 1 | 1 | In-Store Coupon | 95.78 |
| 4 | Breakfast Foods | Frozen Foods | Food | 4.08 | 1.4280 | 3 | Double Down Sale | USA | M | M | ... | 18670 | 5415 | 3610 | 1 | 1 | 1 | 1 | 1 | Radio | 50.79 |
5 rows × 40 columns
data.columns
Index(['food_category', 'food_department', 'food_family',
'store_sales(in millions)', 'store_cost(in millions)',
'unit_sales(in millions)', 'promotion_name', 'sales_country',
'marital_status', 'gender', 'total_children', 'education',
'member_card', 'occupation', 'houseowner', 'avg_cars_at home(approx)',
'avg. yearly_income', 'num_children_at_home',
'avg_cars_at home(approx).1', 'brand_name', 'SRP', 'gross_weight',
'net_weight', 'recyclable_package', 'low_fat', 'units_per_case',
'store_type', 'store_city', 'store_state', 'store_sqft', 'grocery_sqft',
'frozen_sqft', 'meat_sqft', 'coffee_bar', 'video_store', 'salad_bar',
'prepared_food', 'florist', 'media_type', 'cost'],
dtype='object')
data['sales_country'].value_counts()
USA 38892 Mexico 17572 Canada 3964 Name: sales_country, dtype: int64
A quick look in the data shows that it contains data about 3 countries - USA (65%), Mexico (29%), and Canada (6%).
To draw relevant insights from the data, we have decided to go with US data only. This will help us compare the performance of stores in the US. It also makes analysing customer behaviour easier if we focus on one country.
We see that out of about 60,000 customers, 40,000 are based in the US. We feel that this is more than sufficient to make high quality analyses and also draw meaningful conclusions from them. So, we go ahead and drop the rows pertaining to the other countries.
# keeping only US specific data
data = data[data['sales_country'] == 'USA']
Since our analysis is primarily based on stores and their customers, and we already have a lot of columns, we can safely remove some columns that may intuitively feel irrelevant. Moreover, our final aim is to come up with algorithms to predict customer acquisition costs, so there may be quite a few columns that do not have any relation with that aspect of our analysis.
After looking at all the columns, we can identify a few columns which we deem unnecessary for our assesment.
1) There are two columns about the number of cars a customer owns, so we drop the duplicate column
2) The gross and net weight of a product are details of a product and have no relevance to the customer acquisition cost and so canbe dropped. They might have been useful for inventory analysis, but that is beyond the scope of this project. The same is the case for units per case, which is also dropped.
3) Whether a product is low fat or not may be used in studying customer behaviour but if only if all the products are consumable food items, which isn't the case, so we can drop this column.
4) Since we have already filtered the data for US specific data, having a column for sales country is redundant and it needs to be removed from the dataset.
#deleting redundant columns
del data['avg_cars_at home(approx).1']
del data['gross_weight']
del data['net_weight']
del data['low_fat']
del data['units_per_case']
del data['sales_country']
data.shape
(38892, 34)
To make referencing and interpretation easier, we need to properly index the dataset. To this end, we have chosen a multi-level index, first by store state and then by city.
# setting index at state and city level
data = data.set_index(keys=['store_state','store_city']).sort_index(level=[1,0])
Our project mainly classify as Data Analysis, since, we have performed analysis on various columns of our dataset in-order to acheive our aim of helping the management of CFM. All six tasks are focused on the analysis. Data cleaning is the only task which we have implemented which involves data processing.
Although all stores in several states derive considerable profit, it is equally important to align retail operations with sustainiability objectives.
Analysing stores in each state by their production of recyclable packages such as paper bags or cloth sacks, we intend to idenitfy the stores which lack environmental responsibilities thereby, suggesting effective strategies to achieve this goal
A data visualization graph below shows the number of stores commited to environmental sustainability for each state CA, OR and WA. Observation and inference from graph is provided after visualization.
#Number of stores which provides recyclable packages across each state
number_of_recyclable_packages_per_state = (data["recyclable_package"]).groupby("store_state").value_counts()[:,1]
number_of_recyclable_packages_per_state
store_state CA 4918 OR 5927 WA 10817 Name: recyclable_package, dtype: int64
#Number of stores which does not provide recyclable packages across each state
number_of_non_recyclable_packages_per_state = (data["recyclable_package"]).groupby("store_state").value_counts()[:,0]
number_of_non_recyclable_packages_per_state
store_state CA 3976 OR 4701 WA 8553 Name: recyclable_package, dtype: int64
#Plotting graph of total stores across each state being environmental friendly
graph = number_of_recyclable_packages_per_state.plot.bar(color=['lightgreen', 'lime', 'green'])
plt.xlabel("State")
plt.ylabel("Number of Stores")
plt.title("Total stores providing recyclable packages by state")
plt.xticks(rotation = "horizontal")
plt.show()
Observation: From the graph, it can be observed that the stores in Washington invests signifcantly in the production of recyclable packages whereas both Oregon and California are trailing behind in sufficient mnaufacture of the same.
Inference: The stores in California and Oregon should further align their business objectives with ecological preservation. For instance, they can invest in low-cost alternatives such as paper bags or encourage customers to bring their own bag to the stores.
With retail sector being a totally consumer segment, number of transactions between company and customer remains high. With each transactions and physical product being sold, shopping bag had to be provided. Not all stores commits and provides recyclable packages, but many do. Number however, varies across cities. Analysis here shows how many stores provides recyclable packages across cities.
Below analysis with visualization shows no of stores in a particular city which does and does not provide recyclable packages. More so ever, it was believed prior to research that there might not be enough stores which provides recyclable packages as data is quite old. Surprisingly, there are more stores providing recyclable materials as compared to stores not providing them.
#Total stores with recyclable packages across each city
number_of_recyclable_packages_per_city = (data["recyclable_package"]).groupby("store_city").value_counts()[:,1]
number_of_recyclable_packages_per_city
store_city Bellingham 426 Beverly Hills 2280 Bremerton 1971 Los Angeles 2208 Portland 2893 Salem 3034 San Francisco 430 Seattle 2813 Spokane 2449 Tacoma 3158 Name: recyclable_package, dtype: int64
#Total stores in each city which does not provide recyclable packages
number_of_non_recyclable_packages_per_city = (data["recyclable_package"]).groupby("store_city").value_counts()[:,0]
number_of_non_recyclable_packages_per_city
store_city Bellingham 285 Beverly Hills 1871 Bremerton 1480 Los Angeles 1752 Portland 2257 Salem 2444 San Francisco 353 Seattle 2238 Spokane 2004 Tacoma 2546 Name: recyclable_package, dtype: int64
#Plotting graph with total stores seen as environmental sustainable across each city
city_analysis = sns.barplot(y=number_of_recyclable_packages_per_city.index, x = number_of_recyclable_packages_per_city.values)
city_analysis.set( ylabel = 'City', xlabel='Number of Stores')
city_analysis.set_title('Total stores providing recyclable packages by city')
plt.show()
Observation: Most stores in the cities of Washington State are environmentally inclined in their package utility barring Bellingham. Furthermore, although only a few cities in Oregon have these retail stores, a significant proportion of them use sustainable bags. Conversely, the stores in the urban regions of California have the least utility of recyclable bags.
Inference: Stores in the cities of California, particularly San Francisco, can partner with local manufacturers of sustainable packages to enable both economical and efficient mass production. Additionally, they can introduce incentives such as substantial discounted prices to customers who carry their own eco-friendly bag from home.
Here, we are trying to analyse the average profits obtained by the stores at various states and cities in the United states and make a few inferences from the available data.
First, we obtain the profits for each store within the cities of various states in the United states and store it in a dataframe.
# yield the profit column
data['Profit'] = data['store_sales(in millions)'] - data['store_cost(in millions)']
Next, we find the average size of various stores, along with the average shelf spaces alotted for various sections for multiple types of stores.
# Calculating the average store-size
df_ProfitVariables = data[['store_type','store_sqft', 'grocery_sqft', 'frozen_sqft', 'meat_sqft']].groupby(by = 'store_type').agg(func = 'mean').round(3)
Initially, let's try to find any relationships between the average profits and any other factors that may influence the profits.
# retrieve variables which may have correlation with profit
df_ProfitVariables = data[['Profit','unit_sales(in millions)', 'store_type', 'store_sqft', 'brand_name', 'grocery_sqft', 'frozen_sqft', 'meat_sqft']]
df_ProfitVariables.corr().round(3)
| Profit | unit_sales(in millions) | store_sqft | grocery_sqft | frozen_sqft | meat_sqft | |
|---|---|---|---|---|---|---|
| Profit | 1.000 | 0.496 | 0.007 | -0.007 | 0.023 | 0.023 |
| unit_sales(in millions) | 0.496 | 1.000 | 0.020 | -0.006 | 0.050 | 0.050 |
| store_sqft | 0.007 | 0.020 | 1.000 | 0.923 | 0.867 | 0.867 |
| grocery_sqft | -0.007 | -0.006 | 0.923 | 1.000 | 0.609 | 0.609 |
| frozen_sqft | 0.023 | 0.050 | 0.867 | 0.609 | 1.000 | 1.000 |
| meat_sqft | 0.023 | 0.050 | 0.867 | 0.609 | 1.000 | 1.000 |
# average profits based on store type
profitByStoreType = df_ProfitVariables.groupby(by = 'store_type').agg(func = 'mean')['Profit']
print('The profit of stores based on the store type: ',profitByStoreType)
The profit of stores based on the store type: store_type Deluxe Supermarket 3.944830 Gourmet Supermarket 3.973857 Small Grocery 2.051111 Supermarket 4.004565 Name: Profit, dtype: float64
def createBarPlot(xArray, yArray, title = None, xLable = None, yLabel = None, xRotation = 60):
'''
Create a formatted vertical barplot.
'''
barPlotObj = sns.barplot(x= xArray, y = yArray, errwidth = 2)
if not title is None:
barPlotObj.set_title(title)
if not xLable is None:
barPlotObj.set(xlabel=xLable)
if not yLabel is None:
barPlotObj.set(ylabel = yLabel)
barPlotObj.set_ylim(top = max(yArray) * 1.1)
for bar, yVal in zip(barPlotObj.patches, yArray):
# add text for each bar
text_x = bar.get_x() + bar.get_width() / 2.0
text_y = bar.get_height()
text = f'{yVal:.3}'
barPlotObj.text(text_x, text_y, text, fontsize=11, ha='center', va='bottom')
plt.xticks(rotation=xRotation)
return barPlotObj
def createBarPlotHorizontal( labelArr,valArr, title = None, xLable = None, yLabel = None):
'''
Create a formatted horizontal barplot.
'''
barPlotObj = sns.barplot(x= valArr, y = labelArr, errwidth = 2)
if not title is None:
barPlotObj.set_title(title)
if not xLable is None:
barPlotObj.set(xlabel=xLable)
if not yLabel is None:
barPlotObj.set(ylabel = yLabel)
# formatting
barPlotObj.set_xlim(0, valArr.values.max() * 1.1)
for bar, yVal in zip(barPlotObj.patches, valArr.values):
# add text for each bar
text_x = bar.get_x() + bar.get_width() + .2
text_y = bar.get_y() + bar.get_height() / 2
text = f'{yVal:.3}'
barPlotObj.text(text_x, text_y, text, fontsize=11, ha='center', va='bottom')
createBarPlotHorizontal(profitByStoreType.index, profitByStoreType, 'Average Profit by Store Type (in million)', 'Store Type', 'Average Profit')
Observation: On average, stores in Bellingham and San Francisco make the least profit.
Further, let us understand if our analysis still holds for different types of stores in various cities in the United States.
#Average profits in cities based on store type
profitByCityAndStoreType = df_ProfitVariables.groupby(by = ['store_city', 'store_type']).agg(func = 'mean')['Profit'].sort_values(ascending=False)
print(profitByCityAndStoreType)
store_city store_type Portland Supermarket 4.048342 Los Angeles Supermarket 4.002658 Bremerton Supermarket 4.000219 Seattle Supermarket 3.989026 Tacoma Deluxe Supermarket 3.977632 Spokane Supermarket 3.976626 Beverly Hills Gourmet Supermarket 3.973857 Salem Deluxe Supermarket 3.910675 Bellingham Small Grocery 2.084581 San Francisco Small Grocery 2.020719 Name: Profit, dtype: float64
createBarPlotHorizontal(profitByCityAndStoreType.index.values, profitByCityAndStoreType, 'Average Profit by City and Store Type (in million)', 'City', 'Average Profit')
Observation: From the above data, we can observe that the Supermarket stores which have relatively smaller area compared to Deluxe Supermarket stores make highest profits across cities.
Performing a further drillup on the data, let us verify if our previous observation still holds true even at the state level.
profitByStateCityAndStoreType =df_ProfitVariables.groupby(by = ['store_state', 'store_city', 'store_type']).agg(func = 'mean')['Profit'].sort_index(level = 0, ascending= False)
createBarPlotHorizontal(profitByStateCityAndStoreType.index.values, profitByStateCityAndStoreType, \
'Average Profit by State, City, and Store Type (in million)')
Observation: From the data visualization, we can observe that the previous observation on 'supermarkets' making better profits still holds. From the analysis we could further infer that
The goal is to analyse the store attributes based on location, SRP refers to the price set by manufacturers at which the retailers are supposed to sell items in the market. So, we try to determine the average price of a product at different state.
#calculating the premium products in the stores across various cities in multipe states in the United States.
df_store_counts = pd.DataFrame(data.groupby(by = ['store_state','store_city','brand_name']).agg({'SRP':'mean'}))
#sorting and ordering the data
df_store_counts_sorted = df_store_counts['SRP'].groupby(['store_state','store_city'], group_keys=False)
top_store_srp = pd.DataFrame(df_store_counts_sorted.apply(lambda x:x.sort_values(ascending=False).head(5)))
top_store_srp = top_store_srp.reset_index()
top_store_srp.head()
| store_state | store_city | brand_name | SRP | |
|---|---|---|---|---|
| 0 | CA | Beverly Hills | Gauss | 3.289412 |
| 1 | CA | Beverly Hills | James Bay | 3.218182 |
| 2 | CA | Beverly Hills | Colossal | 3.005833 |
| 3 | CA | Beverly Hills | Medalist | 2.963158 |
| 4 | CA | Beverly Hills | Quick | 2.953333 |
#Visualizing various premium brands at different stores in the united states.
fig = px.bar(top_store_srp, x="brand_name", color="store_state",
y='SRP',
title="",
barmode='group',
height=700,
facet_col="store_state"
)
fig.update_traces(width=1)
fig.show()
Inference - From the graphs, it can be observed that not products of all brands are being sold in all the states.
Also, it can be observed that the average price set by manufacturers varies across the different states.
It can also be observed that Washington has the highest set average SRP across all the stores than the remaining states.
Observation - As depicted, the price of products is maximum in Washington, when compared to the prices at Orlando and California. Hence, it can be inferred that the maximum profits are bing driven from the state of Washington, it can be suggested that the brands increase their prices at the other states so as to gain better profitability. Furthermore, some brands needs much better marketing and sales strategies to estbalish themselves amongst the top.
The given dataset has several promotion strategies that users use to redeem while shopping. These promotion strategies vary according to cities and states. Each promotion strategy is associated with the corresponding media type used to promote the coupons and its associated cost for the advertisement. We are analyzing which promotion type is famous for each city and based on city properties we could infer which promotion type can be extended to other cities. We have data for three states (Washington, California and Oregon) and few cities in them.
# Finding the promotion count based on state and city
count = data.groupby([data.index,'promotion_name']).count()
count.reset_index(inplace = True)
count[['store_state', 'store_city']] = pd.DataFrame(count['level_0'].tolist(), index=count.index)
count = count[['store_state','store_city', 'promotion_name', 'food_category']]
count.rename({'food_category' : 'promotion_count'}, axis = 'columns', inplace = True)
# Here, the count dataframe stores all the count of various promotions used by users across cities.
count.head()
| store_state | store_city | promotion_name | promotion_count | |
|---|---|---|---|---|
| 0 | CA | Beverly Hills | Best Savings | 174 |
| 1 | CA | Beverly Hills | Big Promo | 151 |
| 2 | CA | Beverly Hills | Big Time Savings | 80 |
| 3 | CA | Beverly Hills | Bye Bye Baby | 115 |
| 4 | CA | Beverly Hills | Double Down Sale | 127 |
# Finding the max prmotion used
count.loc[count['promotion_count'].idxmax()]
store_state OR store_city Salem promotion_name Cash Register Lottery promotion_count 1555 Name: 82, dtype: object
import plotly.express as px
# Plotting a donut chart for promotion types acroos states and cities
fig = px.sunburst(count, path=['store_state', 'store_city', 'promotion_name'], values='promotion_count',
hover_data=['promotion_count'])
fig.update_layout(
autosize=False,
width=800,
height=800,
title="Promotions used across States and Cities<br><sup>(Hover over the components to access the promotion count)</sup>")
fig.show()
The above displayed hierarchical donut pie-chart shows that promotions are highly used in Washington, followed by Oregon and then California. Following table shows which promotion types are frequently used in various cities -
| Sr No. | State | Cities | Promotion Type | Promotion Count |
|---|---|---|---|---|
| 1 | Oregon | Salem | Cash Register Lottery | 1555 |
| 2 | Washington | Seatlle | Weekend Markdown | 904 |
| 3 | Oregon | Portland | Free For All | 645 |
| 4 | California | Los Angeles | High Roller Savings | 639 |
| 5 | Washington | Tacoma | Super Duper Savers | 622 |
| 6 | Washington | Spokane | Sales Days | 512 |
| 7 | California | Beverly Hills | Two Day Sale | 482 |
| 8 | Washington | Bremerton | Shelf Clearing Days | 445 |
| 9 | California | San Francisco | Savings Galore | 86 |
| 10 | Washington | Bellingham | Coupon Spectacular | 64 |
Inference: Cash Register Lottery is the highest used promotion in the Salem city. But, this type of promotion is not being used in any other city. For example, Seattle, Tacoma and Spokane are having huge customer interactions which uses promotions, however, Cash Register Lottery is not being used. This specific type of promotion can be extended to these cities, by doing this store can be profitable. Similar can be done other way round, where Weekend Markdown can be pushed to cities in Oregon and California. All leading promtion types can be applied to other cities for users to redeem maximum savings and stores to make maximum profits.
This dataset contains sufficient features to analyze the customers and their shopping preferences. For example, which customers (with or without kids) are prone to buy certain membership of the store. By looking at the relevant featuers, we'll derive insights and inferences on customer behavior. We'll also extend this analysis by probing into two pairs of correlation, one between food category and store location, the other between food category and store location. Here, we'll analyze how consumer preference for food is affected based on the location, and which food type is popular and based on the customer martial status, what food categories can be pushed ahead to increase the sales.
#Calling a function to find whether observation have children or not
def children(value):
if value >= 1:
return "Y"
else:
return "N"
#Creating a new dataframe column for above observation
data['children'] = data['total_children'].map(children)
display(data.head())
| food_category | food_department | food_family | store_sales(in millions) | store_cost(in millions) | unit_sales(in millions) | promotion_name | marital_status | gender | total_children | ... | meat_sqft | coffee_bar | video_store | salad_bar | prepared_food | florist | media_type | cost | Profit | children | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| store_state | store_city | |||||||||||||||||||||
| WA | Bellingham | Breakfast Foods | Frozen Foods | Food | 3.28 | 1.3776 | 2 | Coupon Spectacular | S | F | 1 | ... | 2374 | 1 | 0 | 0 | 0 | 0 | Daily Paper | 90.58 | 1.9024 | Y |
| Bellingham | Breakfast Foods | Frozen Foods | Food | 1.36 | 0.4488 | 1 | Super Savers | M | F | 4 | ... | 2374 | 1 | 0 | 0 | 0 | 0 | TV | 114.98 | 0.9112 | Y | |
| Bellingham | Breakfast Foods | Frozen Foods | Food | 1.87 | 0.6171 | 1 | Sales Galore | M | M | 1 | ... | 2374 | 1 | 0 | 0 | 0 | 0 | Daily Paper, Radio | 109.03 | 1.2529 | Y | |
| Bellingham | Bread | Baked Goods | Food | 3.93 | 1.8078 | 1 | Double Your Savings | S | M | 4 | ... | 2374 | 1 | 0 | 0 | 0 | 0 | Street Handout | 122.16 | 2.1222 | Y | |
| Bellingham | Bread | Baked Goods | Food | 1.06 | 0.5194 | 2 | Wallet Savers | M | F | 3 | ... | 2374 | 1 | 0 | 0 | 0 | 0 | TV | 139.36 | 0.5406 | Y |
5 rows × 34 columns
# getting the count of the marital status
data.marital_status.value_counts()
S 19793 M 19099 Name: marital_status, dtype: int64
# getting the count of the gender value
data.gender.value_counts()
M 19692 F 19200 Name: gender, dtype: int64
# getting the count of the customers with and without kids
data.children.value_counts()
Y 35307 N 3585 Name: children, dtype: int64
# getting the count of number of kids
data['total_children'].value_counts()
1 8015 4 7982 2 7875 3 7730 5 3705 0 3585 Name: total_children, dtype: int64
#To obtain new dataframe for demographics
data_customer_behavior = data.groupby(by=['gender','marital_status','children'])
#Having a detail overview for our focused data
for name, group in data_customer_behavior:
print(name, len(group))
('F', 'M', 'N') 910
('F', 'M', 'Y') 8700
('F', 'S', 'N') 713
('F', 'S', 'Y') 8877
('M', 'M', 'N') 932
('M', 'M', 'Y') 8557
('M', 'S', 'N') 1030
('M', 'S', 'Y') 9173
#creating a dictionary
demographic_dictionary = {'Category':[], 'Occurence':[]}
#Appending the dictionary
for name,group in data_customer_behavior:
demographic_dictionary['Category'].append(str(name))
demographic_dictionary['Occurence'].append(len(group))
#Creating a dataframe for visualization
df_of_customer = pd.DataFrame(demographic_dictionary)
df_of_customer
| Category | Occurence | |
|---|---|---|
| 0 | ('F', 'M', 'N') | 910 |
| 1 | ('F', 'M', 'Y') | 8700 |
| 2 | ('F', 'S', 'N') | 713 |
| 3 | ('F', 'S', 'Y') | 8877 |
| 4 | ('M', 'M', 'N') | 932 |
| 5 | ('M', 'M', 'Y') | 8557 |
| 6 | ('M', 'S', 'N') | 1030 |
| 7 | ('M', 'S', 'Y') | 9173 |
#creating a virtual plot
import plotly.express as px
fig = px.bar(df_of_customer, x="Category", y="Occurence", color="Occurence", barmode="group", title="Distribution of customers based on number of children and marital status <br> (F=Female, M=Male, N=No Children, Y=Atleast one child, S=Single)")
fig.show()
Observation - From the plot, it can be observed that irrespective of marital status and gender, the segment with most number of people visiting is the segment of child-bearers
Furthermore, there is massive difference between customers who are child-bearers and customers without children. Across the segments of child-bearers married customers are less compared to single customers.
Inference - For generating demographics related analysis, three variables were used age, marital status and child-bearers
It can be inferred that retail store hosts more child-bearers customers and business strategies has to influneced based on this observation. It is in a way most important insight a retail chain can have as children have considerable buying power as well as they influence their parent's buying power.
In the following snippets we want to see if there is any customer feature that could be predicted by another set of customer features. The first customer feature that we wish to predict is the customer yearly income. Predicting the customer yearly income is of great commercial value because with such a predictive model it becomes rather easy for us to group our customers into different income groups, whereby we could apply customized and more effective promotion strategies to them. For example, in case of Macy's, when a customer tries to apply for a membership for being an exclusive member of the store, the customer financial profile is scanned and then it is decided whether the customer is eligible for the membership. Similarly, having user yearly income prediction based on customer features will help management to ease the data collection and predict the customer eligiblity better for exculsive membership.
def dfMultiMergeByIndex(dfArr):
'''
merge multiple dataframes based on their index columns
'''
isFirst = True
res = None
for df in dfArr:
if isFirst:
res = df
isFirst = False
else:
res = pd.merge(res, df, right_index=True, left_index=True)
return res
# fetching only customer related features
customerData = data.iloc[:, 7:17].reset_index(drop= True) # retrieve all the columns that represent customer features
customerData.head()
| marital_status | gender | total_children | education | member_card | occupation | houseowner | avg_cars_at home(approx) | avg. yearly_income | num_children_at_home | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | S | F | 1 | Bachelors Degree | Bronze | Professional | Y | 2 | $50K - $70K | 0 |
| 1 | M | F | 4 | High School Degree | Bronze | Manual | Y | 4 | $30K - $50K | 4 |
| 2 | M | M | 1 | Partial High School | Normal | Skilled Manual | Y | 1 | $10K - $30K | 0 |
| 3 | S | M | 4 | Graduate Degree | Bronze | Professional | N | 2 | $70K - $90K | 0 |
| 4 | M | F | 3 | Bachelors Degree | Golden | Management | Y | 2 | $50K - $70K | 3 |
# converrt binary vars into 0-1 vars
customerData.marital_status = customerData.marital_status.map(lambda v : 0 if v =='S' else 1)
customerData.gender = customerData.gender.map(lambda v : 0 if v =='F' else 1)
customerData.houseowner = customerData.houseowner.map(lambda v : 0 if v =='N' else 1)
# convert categorical vars into dummy vars
eduDummy = pd.get_dummies(customerData.education)
memDummy = pd.get_dummies(customerData.member_card)
ocuDummy = pd.get_dummies(customerData.occupation)
yearlyIncome = customerData['avg. yearly_income']
customerDataKNN = customerData.drop(columns = ['education', 'member_card', 'occupation', 'avg. yearly_income'])
customerDataKNN = dfMultiMergeByIndex([customerDataKNN, eduDummy, memDummy, ocuDummy])
customerDataKNN.head(5)
| marital_status | gender | total_children | houseowner | avg_cars_at home(approx) | num_children_at_home | Bachelors Degree | Graduate Degree | High School Degree | Partial College | Partial High School | Bronze | Golden | Normal | Silver | Clerical | Management | Manual | Professional | Skilled Manual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 1 | 0 | 4 | 1 | 4 | 4 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 0 | 1 | 4 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 1 | 0 | 3 | 1 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
xArr = customerDataKNN
yArr = yearlyIncome
# split the dataset into training and testing subsets
xTrainKNN, xTestKNN, yTrainKNN, yTestKNN = train_test_split(
xArr.values, yArr.values, random_state=11)
# defining the KNN classifier for yearly income prediction
knnEstimatorIncome = KNeighborsClassifier()
knnEstimatorIncome.fit(X=xTrainKNN, y=yTrainKNN)
KNeighborsClassifier()
predictedKNN = knnEstimatorIncome.predict(X=xTestKNN)
expectedKNN = yTestKNN
# getting the accuracy of the prediction model
print(f'The accuracy of the prediction model is: {knnEstimatorIncome.score(xTestKNN, yTestKNN):.2%}')
The accuracy of the prediction model is: 90.80%
predAndExpected = [(x,y) for x,y in zip(predictedKNN, expectedKNN)]
sampleResult = random.sample(predAndExpected, 10)
# running the prediction model on the testing data
print('Testing the prediction model on the testing data -')
for x,y in sampleResult:
print(f'predicted: {x}' , ' \t', f'expected: {y}')
Testing the prediction model on the testing data - predicted: $50K - $70K expected: $50K - $70K predicted: $70K - $90K expected: $70K - $90K predicted: $70K - $90K expected: $50K - $70K predicted: $70K - $90K expected: $70K - $90K predicted: $10K - $30K expected: $10K - $30K predicted: $50K - $70K expected: $50K - $70K predicted: $10K - $30K expected: $10K - $30K predicted: $50K - $70K expected: $50K - $70K predicted: $30K - $50K expected: $30K - $50K predicted: $30K - $50K expected: $50K - $70K
yearlyIncome.value_counts().plot(kind = 'bar', figsize = (10,6), rot = 0, title = 'Histogram of Customer Income Group', xlabel = 'Annual income group', ylabel = 'Number of customers')
<AxesSubplot:title={'center':'Histogram of Customer Income Group'}, xlabel='Annual income group', ylabel='Number of customers'>
The model features relatively high precision rate, it is a valid model. And for this validity, we can conclude that a customer's yearly income could be predicted by his/her marital status, gender, total children, education, member card, occupation, real estate ownership, average cars at home, number of children at home.
Suggestion to the mangement: Based on this model, the management could understand the customers well before they have information on their annual income. Using this, they can target consumers with higher yearly income with exclusive featuers like free home delivery, same day home delivery, etc. By apply this strategy, management will be in position to lock-in these customers for a longer period of time once they buy the exclusive subscription.
To try to display our data on a three-dimensional graph, we perform primary component analysis to reduce the high dimensional input data into a set of two-D vectors. We then plot scatter plot for both the prediction and the expected values.
priCompAnalyzer3D = decomposition.PCA(3)
priCompAnalyzer3D.fit(xTrainKNN)
PCA(n_components=3)
xTestKNN3D = priCompAnalyzer3D.transform(xTestKNN)
xTestKNN3D.shape
(9723, 3)
knnResult3D = pd.DataFrame(xTestKNN3D, columns = ['x', 'y', 'z'])
knnResult3D['predicted'] = predictedKNN
knnResult3D['expected'] = expectedKNN
incomeGroupToColor = {income : color for income, color in zip(np.unique(knnResult3D.predicted), ['c', 'b', 'g', 'r', 'm', 'y', 'k', 'w'])}
knnResult3D['colorPredicted'] = knnResult3D.predicted.map(incomeGroupToColor)
knnResult3D['colorExpected'] = knnResult3D.expected.map(incomeGroupToColor)
rowSampler = random.sample([i for i in range(len(knnResult3D))], 1000)
knnResult3DSample = knnResult3D.iloc[rowSampler, :]
px.scatter_3d(data_frame= knnResult3DSample, x = 'x', y = 'y', z = 'z', color= 'predicted', title = 'Predicted Customer Income Group')
px.scatter_3d(data_frame= knnResult3DSample, x = 'x', y = 'y', z = 'z', color= 'expected', title = 'Expected Customer Income Group')
In order to predict the cost of acquiring customers, we look at the different kinds of customers we have and use there features as predictors. Our hope is to find some dependence of the customer acquisition costs on the various features of the customers. These features include their marital status, sex, the number of children they have, their occupation, yearly income range, education level, whether they have their own home and what kind of loyalty card they have.
By looking at these features, we hypothesize that there may be some relation between them and their corresponding customers' cost of acquisition. At the moment, we do not know whether such a relation exists and if it does, how it impacts our costs. Since we are focussing exclusively on customer details, we will be filtering our original dataframe to only have those variables which are relevant for customer details. Following that, we will be creating dummy variables for each categorical variable. This means that, for each category of a categorical variable, there will be a column that indicates whether the row belongs to that category or not.
We then attempt to predict the customer acquisition costs by creating three regression models. The three models are:
1) Ordinary Least Squares or Linear Regression 2) Decision Tree Regressor 3) XGBoost Regressor
#creating a list of customer features
customer_columns = ['marital_status', 'gender',
'total_children', 'education', 'occupation',
'houseowner', 'avg_cars_at home(approx)', 'avg. yearly_income', 'member_card', 'cost']
#creating a new data frame for customers
df_customer = data[customer_columns].copy(deep= True)
#identifying categorical variables for making dummy variables
categorical_columns = ['marital_status', 'gender', 'education', 'occupation',
'houseowner', 'avg. yearly_income', 'member_card']
#showing all categories of all categorical variables
val_dict = {}
for col in categorical_columns:
val_dict.update({col : df_customer[col].value_counts().index})
val_dict
{'marital_status': Index(['S', 'M'], dtype='object'),
'gender': Index(['M', 'F'], dtype='object'),
'education': Index(['High School Degree', 'Partial High School', 'Bachelors Degree',
'Partial College', 'Graduate Degree'],
dtype='object'),
'occupation': Index(['Professional', 'Skilled Manual', 'Manual', 'Management', 'Clerical'], dtype='object'),
'houseowner': Index(['Y', 'N'], dtype='object'),
'avg. yearly_income': Index(['$30K - $50K', '$10K - $30K', '$50K - $70K', '$70K - $90K',
'$130K - $150K', '$110K - $130K', '$90K - $110K', '$150K +'],
dtype='object'),
'member_card': Index(['Bronze', 'Normal', 'Golden', 'Silver'], dtype='object')}
#creating dummy variables for each categorical variable
df_customer = pd.get_dummies(df_customer, columns= categorical_columns, drop_first= True)
df_customer.head()
| total_children | avg_cars_at home(approx) | cost | marital_status_S | gender_M | education_Graduate Degree | education_High School Degree | education_Partial College | education_Partial High School | occupation_Management | ... | avg. yearly_income_$110K - $130K | avg. yearly_income_$130K - $150K | avg. yearly_income_$150K + | avg. yearly_income_$30K - $50K | avg. yearly_income_$50K - $70K | avg. yearly_income_$70K - $90K | avg. yearly_income_$90K - $110K | member_card_Golden | member_card_Normal | member_card_Silver | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| store_state | store_city | |||||||||||||||||||||
| WA | Bellingham | 1 | 2 | 90.58 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Bellingham | 4 | 4 | 114.98 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Bellingham | 1 | 1 | 109.03 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
| Bellingham | 4 | 2 | 122.16 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
| Bellingham | 3 | 2 | 139.36 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 24 columns
#Seperating the independent and dependent variables
X = df_customer.drop(columns= ['cost'])
X = sm.add_constant(X) #adding a constant to the independent variable to get intercept of linear regression line
y = df_customer['cost']
#Linear Regression / OLS model
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
ols_results.summary()
| Dep. Variable: | cost | R-squared: | 0.003 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.003 |
| Method: | Least Squares | F-statistic: | 5.930 |
| Date: | Mon, 05 Dec 2022 | Prob (F-statistic): | 4.70e-18 |
| Time: | 19:57:36 | Log-Likelihood: | -1.8795e+05 |
| No. Observations: | 38892 | AIC: | 3.760e+05 |
| Df Residuals: | 38868 | BIC: | 3.762e+05 |
| Df Model: | 23 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 102.2828 | 1.645 | 62.171 | 0.000 | 99.058 | 105.507 |
| total_children | -0.0619 | 0.108 | -0.574 | 0.566 | -0.273 | 0.149 |
| avg_cars_at home(approx) | 0.8919 | 0.172 | 5.179 | 0.000 | 0.554 | 1.229 |
| marital_status_S | 0.6103 | 0.330 | 1.848 | 0.065 | -0.037 | 1.257 |
| gender_M | -0.5850 | 0.310 | -1.890 | 0.059 | -1.192 | 0.022 |
| education_Graduate Degree | -0.2079 | 0.798 | -0.261 | 0.794 | -1.771 | 1.355 |
| education_High School Degree | -3.3260 | 0.688 | -4.834 | 0.000 | -4.675 | -1.977 |
| education_Partial College | -0.5647 | 0.732 | -0.772 | 0.440 | -1.999 | 0.870 |
| education_Partial High School | -2.1937 | 0.730 | -3.006 | 0.003 | -3.624 | -0.763 |
| occupation_Management | -2.5200 | 1.411 | -1.786 | 0.074 | -5.286 | 0.246 |
| occupation_Manual | -2.6514 | 1.419 | -1.868 | 0.062 | -5.433 | 0.130 |
| occupation_Professional | -3.9637 | 1.391 | -2.849 | 0.004 | -6.691 | -1.237 |
| occupation_Skilled Manual | -2.6412 | 1.417 | -1.865 | 0.062 | -5.418 | 0.135 |
| houseowner_Y | 0.6220 | 0.336 | 1.853 | 0.064 | -0.036 | 1.280 |
| avg. yearly_income_$110K - $130K | -2.3128 | 1.248 | -1.853 | 0.064 | -4.759 | 0.134 |
| avg. yearly_income_$130K - $150K | -2.2188 | 1.208 | -1.836 | 0.066 | -4.587 | 0.150 |
| avg. yearly_income_$150K + | 0.3822 | 1.510 | 0.253 | 0.800 | -2.578 | 3.343 |
| avg. yearly_income_$30K - $50K | -0.6120 | 0.890 | -0.688 | 0.491 | -2.356 | 1.132 |
| avg. yearly_income_$50K - $70K | -1.8788 | 0.980 | -1.917 | 0.055 | -3.800 | 0.042 |
| avg. yearly_income_$70K - $90K | -2.9283 | 1.061 | -2.760 | 0.006 | -5.008 | -0.849 |
| avg. yearly_income_$90K - $110K | 1.6155 | 1.262 | 1.280 | 0.201 | -0.858 | 4.089 |
| member_card_Golden | -0.2953 | 0.529 | -0.559 | 0.576 | -1.331 | 0.741 |
| member_card_Normal | -0.7922 | 0.683 | -1.161 | 0.246 | -2.130 | 0.546 |
| member_card_Silver | -1.2319 | 0.580 | -2.124 | 0.034 | -2.369 | -0.095 |
| Omnibus: | 836571.603 | Durbin-Watson: | 1.854 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 2941.472 |
| Skew: | 0.005 | Prob(JB): | 0.00 |
| Kurtosis: | 1.653 | Cond. No. | 77.5 |
#creating a function for VIF analysis and variable selection
def get_vif(train_dataframe):
vif_dict, tolerance_dict = {}, {}
# create formula for each exogenous variable
for col in train_dataframe.drop(columns = ['const']) :
y = train_dataframe[col]
X = train_dataframe.drop(columns = [col])
# extract r-squared from the fit
r_squared = sm.OLS(y,X).fit().rsquared
# calculate VIF
vif = 1/(1 - r_squared)
vif_dict[col] = vif
# calculate tolerance
tolerance = 1 - r_squared
tolerance_dict[col] = tolerance
# return VIF DataFrame
df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})
return df_vif
get_vif(X)
| VIF | Tolerance | |
|---|---|---|
| total_children | 1.076095 | 0.929286 |
| avg_cars_at home(approx) | 1.602988 | 0.623835 |
| marital_status_S | 1.147374 | 0.871556 |
| gender_M | 1.008868 | 0.991210 |
| education_Graduate Degree | 1.337053 | 0.747913 |
| education_High School Degree | 4.168448 | 0.239897 |
| education_Partial College | 1.823190 | 0.548489 |
| education_Partial High School | 4.687363 | 0.213340 |
| occupation_Management | 10.426888 | 0.095906 |
| occupation_Manual | 15.845654 | 0.063109 |
| occupation_Professional | 18.034810 | 0.055448 |
| occupation_Skilled Manual | 16.212233 | 0.061682 |
| houseowner_Y | 1.129459 | 0.885380 |
| avg. yearly_income_$110K - $130K | 2.645354 | 0.378021 |
| avg. yearly_income_$130K - $150K | 3.457997 | 0.289185 |
| avg. yearly_income_$150K + | 1.989750 | 0.502576 |
| avg. yearly_income_$30K - $50K | 7.280988 | 0.137344 |
| avg. yearly_income_$50K - $70K | 5.698303 | 0.175491 |
| avg. yearly_income_$70K - $90K | 5.134349 | 0.194767 |
| avg. yearly_income_$90K - $110K | 2.661621 | 0.375711 |
| member_card_Golden | 1.257513 | 0.795221 |
| member_card_Normal | 3.556942 | 0.281140 |
| member_card_Silver | 1.130706 | 0.884403 |
ols_predictions = ols_results.fittedvalues
#plotting actual vs predicted values
fig = sns.scatterplot(x = df_customer['cost'], y= ols_predictions)
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
plt.title('Actual vs Predicted values of OLS model')
Text(0.5, 1.0, 'Actual vs Predicted values of OLS model')
The Ordinary Least Squares model is the simplest model that we have used, and unfortunately, it has performed worse than we expected. The results of the model show that it has an R squared value of 0.003, a value which should be ideally close to 1. This means that the model has done a really poor job of explaining the variations in the data. Apart from this, a lot of other metrics, like Omnibus and the F-statistic are far from their ideal values.
This means that a model as simple as linear regression won't work for our data. This is probably because there a lot of categorical variables in the data and linear regression works best with numerical data. Another reason for such a low score may be because of high correlation between some predictor variables. To check for this, we use VIF analysis and see that the 'occupation' variable is having too large of a VIF score. However, even after dropping that variable, there is absolutely no difference in the results. So, we have chosen not to show the second iteration of the OLS model.
In hopes of improving our model, we now move to DecisionTree Regreessor which is a bit more complex than Linear Regression.
#dropping the additional 'const' column
X.drop(columns = ['const'], inplace= True)
X.head()
| total_children | avg_cars_at home(approx) | marital_status_S | gender_M | education_Graduate Degree | education_High School Degree | education_Partial College | education_Partial High School | occupation_Management | occupation_Manual | ... | avg. yearly_income_$110K - $130K | avg. yearly_income_$130K - $150K | avg. yearly_income_$150K + | avg. yearly_income_$30K - $50K | avg. yearly_income_$50K - $70K | avg. yearly_income_$70K - $90K | avg. yearly_income_$90K - $110K | member_card_Golden | member_card_Normal | member_card_Silver | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| store_state | store_city | |||||||||||||||||||||
| WA | Bellingham | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Bellingham | 4 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | |
| Bellingham | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
| Bellingham | 4 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
| Bellingham | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 23 columns
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)
#DecisionTree Regressor
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(criterion= 'mse')
dt_reg.fit(X_train, y_train)
DecisionTreeRegressor(criterion='mse')
#making an array of predictions
y_pred = dt_reg.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error
#printing the metrics of DecisionTree model
print(f'R2 score: {r2_score(y_test, y_pred): 0.3f}')
print(f'MSE: {mean_squared_error(y_test, y_pred): 0.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)): 0.3f}')
R2 score: 0.250 MSE: 698.290 RMSE: 26.425
#plotting actual vs predicted model
sns.scatterplot(x = y_test, y= y_pred)
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
plt.title('Actual vs Predicted values of DecisionTree Regressor')
Text(0.5, 1.0, 'Actual vs Predicted values of DecisionTree Regressor')
The performance of the Decision Tree model is not as good as we would have liked it to be, but it is exponentially better than the OLS model. We see an improvement of the R squared value from 0.003 to 0.240. A close look at the scatter plot of 'Actual values vs Predicted values' reveals a straight line going though the origin, but a lot of other points covering it. Other error metrics like Mean Squared Error and Root Mean Squared error are also quite tolerable.
An RMSE value of 26.59 means that the average difference between an actual value of cost and the cost as predicted by the model is around $26. For a grocery chain, running on thin margins, this difference is quite large. We need to find ways to improve the quality of predictions further. So, we move to a slightly more complex model - XGBoost.
#XGBoost Regressor
from xgboost import XGBRegressor
import xgboost as xgb
xgb_model = XGBRegressor()
xgb_model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=16,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
#Plotting 10 most important factors in predicting cost as used by the XGB Regressor model
xgb_fig = xgb.plot_importance(xgb_model, max_num_features= 10)
#creating an array for predictions
y_pred = xgb_model.predict(X_test)
#printing metrics of XGB Regressor model
print(f'R2 score: {r2_score(y_test, y_pred): 0.2f}')
print(f'MSE: {mean_squared_error(y_test, y_pred): 0.2f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)): 0.2f}')
R2 score: 0.17 MSE: 772.32 RMSE: 27.79
#plotting actual vs predicted values
sns.scatterplot(x = y_test, y = y_pred)
plt.xlabel('Actual value')
plt.ylabel('Predicted value')
plt.title('Actual vs Predicted values of XGBRegressor')
Text(0.5, 1.0, 'Actual vs Predicted values of XGBRegressor')
Surprisingly, the Decision Tree model performed slightly better than the XGBoost mdoel. The R squared value for the latter is 0.17 and MSE and RMSE are 765.31 and 27.66. The inner mathematics of XGBoost are quite complex, however, we can visualize the importance that the mdoel gave to each of our features, in case there is an insight that can be drawn from them. On making this plot, we see that the model gave most importance to the 'Total Number of Children' variable, which aligns with our previous finding that our core customer segment is people with children. This means that there may really be a strong relation among the number of children our customers have and the customer acquisition cost.
However, at the end, we would choose to go with the DecisionTree Regressor model if we had to do any prediction using this exact dataset.
Although all models performed differently and gave us different results, none of them were as good as we would have liked them to be. Our best working model is still $26 off in predicting the cost of customer acquisition, which is quite a lot. In order to improve the performance of our models, we can take a few steps, some of which are listed below:
1) We can change the proportion of training and testing subsets of our data. 2) We can use cross validation to test the performance of our model on new data, in a better way. 3) We can tune hyper-parameters of the models, since we have just used the default values. 4) Most importantly, we can urge management to focus on expanding the dataset, to include more customer features that are more relevant to our usecase, such as time spent in the store, aisles visited, how customers respond to various promotional deals, etc. This will provide better insights into customer behaviour and make predicting customer acquisition costs easier.